Search CORE

31 research outputs found

Starlight: A kernel optimizer for GPU processing

Author: Alberto Zeni
Davide Conficconi
Eleonora D'Arnese
Emanuele del Sozzo
Marco Domenico Santambrogio
Publication venue
Publication date: 01/01/2024
Field of study

Archivio istituzionale della ricerca - Politecnico di Milano

Tiramisu: A Polyhedral Compiler for Expressing Fast and Portable Code

Author: Akkas Abdurrahman
Amarasinghe Saman
Baghdadi Riyadh
Del Sozzo Emanuele
Kamil Shoaib
Ray Jessica
Romdhane Malek Ben
Suriana Patricia
Zhang Yunming
Publication venue
Publication date: 20/12/2018
Field of study

This paper introduces Tiramisu, a polyhedral framework designed to generate high performance code for multiple platforms including multicores, GPUs, and distributed machines. Tiramisu introduces a scheduling language with novel extensions to explicitly manage the complexities that arise when targeting these systems. The framework is designed for the areas of image processing, stencils, linear algebra and deep learning. Tiramisu has two main features: it relies on a flexible representation based on the polyhedral model and it has a rich scheduling language allowing fine-grained control of optimizations. Tiramisu uses a four-level intermediate representation that allows full separation between the algorithms, loop transformations, data layouts, and communication. This separation simplifies targeting multiple hardware architectures with the same algorithm. We evaluate Tiramisu by writing a set of image processing, deep learning, and linear algebra benchmarks and compare them with state-of-the-art compilers and hand-tuned libraries. We show that Tiramisu matches or outperforms existing compilers and libraries on different hardware architectures, including multicore CPUs, GPUs, and distributed machines.Comment: arXiv admin note: substantial text overlap with arXiv:1803.0041

arXiv.org e-Print Archive

Crossref

Archivio istituzionale della ricerca - Politecnico di Milano

A Framework for Customizable FPGA-based Image Registration Accelerators

Author: Conficconi Davide
D'Arnese Eleonora
Del Sozzo Emanuele
Santambrogio Marco D.
Sciuto Donatella
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2021
Field of study

Image Registration is a highly compute-intensive optimization procedure that determines the geometric transformation to align a floating image to a reference one. Generally, the registration targets are images taken from different time instances, acquisition angles, and/or sensor types. Several methodologies are employed in the literature to address the limiting factors of this class of algorithms, among which hardware accelerators seem the most promising solution to boost performance. However, most hardware implementations are either closed-source or tailored to a specific context, limiting their application to different fields. For these reasons, we propose an open-source hardware-software framework to generate a configurable architecture for the most compute-intensive part of registration algorithms, namely the similarity metric computation. This metric is the Mutual Information, a well-known calculus from the Information Theory, used in several optimization procedures. Through different design parameters configurations, we explore several design choices of our highly-customizable architecture and validate it on multiple FPGAs. We evaluated various architectures against an optimized Matlab implementation on an Intel Xeon Gold, reaching a speedup up to 2.86x, and remarkable performance and power efficiency against other state-of-the-art approaches

Archivio istituzionale della ricerca - Politecnico di Milano

CICERO: A Domain-Specific Architecture for Efficient Regular Expression Matching

Author: Christian Pilato
Daniele Parravicini
Davide Conficconi
Emanuele Del Sozzo
Marco D. Santambrogio
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2021
Field of study

Regular Expression (RE) matching is a computational kernel used in several applications. Since RE complexity and data volumes are steadily increasing, hardware acceleration is gaining attention also for this problem. Existing approaches have limited flexibility as they require a different implementation for each RE. On the other hand, it is complex to map efficient RE representations like non-deterministic finite-state automata onto software-programmable engines or parallel architectures. In this work, we present CICERO, an end-to-end framework composed of a domain-specific architecture and a companion compilation framework for RE matching. Our solution is suitable for many applications, such as genomics/proteomics and natural language processing. CICERO aims at exploiting the intrinsic parallelism of non-deterministic representations of the REs. CICERO can trade-off accelerators’ efficiency and processors’ flexibility thanks to its programmable architecture and the compilation framework. We implemented CICERO prototypes on embedded FPGA achieving up to 28.6× and 20.8× more energy efficiency than embedded and mainstream processors, respectively. Since it is a programmable architecture, it can be implemented as a custom ASIC that is orders of magnitude more energy-efficient than mainstream processors

Archivio istituzionale della ricerca - Politecnico di Milano

Large Forests and Where to “Partially” Fit Them

Author: Damiani Andrea
Santambrogio Marco D.
Sozzo Emanuele Del
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2022
Field of study

The Artificial Intelligence of Things (AIoT) calls for on-site Machine Learning inference to overcome the instability in latency and availability of networks. Thus, hardware acceleration is paramount for reaching the Cloud’s modeling performance within an embedded device’s resources. In this paper, we propose Entree, the first automatic design flow for deploying the inference of Decision Tree (DT) ensembles over Field-Programmable Gate Arrays (FPGAs) at the network’s edge. It exploits dynamic partial reconfiguration on modern FPGA-enabled Systems-on- a-Chip (SoCs) to accelerate arbitrarily large DT ensembles at a latency a hundred times stabler than software alternatives. Plus, given Entree’s suitability for both hardware designers and non-hardware-savvy developers, we believe it has the potential of helping data scientists to develop a non-Cloud-centric AIoT

Archivio istituzionale della ricerca - Politecnico di Milano

Expertise and trade-offs in competence transfer from academia to industry: a successful case study

Author: Damiani Andrea
Del Sozzo Emanuele
Santambrogio Marco Domenico
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2021
Field of study

Archivio istituzionale della ricerca - Politecnico di Milano

A highly scalable and efficient parallel design of N-body simulation on FPGA

Author: Del Sozzo Emanuele
Di Tucci Lorenzo
Santambrogio Marco D.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2017
Field of study

N-Body simulation simulates the evolution of a system that is composed of N particles, where each element receives a force that is due to the interaction with all the other elements within the system. Usually, the influence of external physical forces, such as gravity, is involved too. This methodology is widely used in different fields that range from astrophysics, where it is used to study the interaction of celestial objects, to molecular dynamics, where the bodies are represented by molecules. Although its wide range of applicability, the algorithm presents a high computational complexity that requires the usage of powerful and high power consuming computers. An acceleration on a reconfigurable device, such as an FPGA, would benefit both in term of performance and power consumption. In this work we presents a scalable, high performance and highly efficient implementation of an N-Body simulation algorithm on FPGA. The final design is able to outperform both CPU and FPGA works in the state of the art in terms of pure performance of a factor up to 10x, and high-end GPUs in terms of performance per watt by a factor of 1.84x

Archivio istituzionale della ricerca - Politecnico di Milano

Crossref

Pushing the Level of Abstraction of Digital System Design: a Survey on How to Program FPGAs

Author: Alberto Zeni
Davide Conficconi
Donatella Sciuto
Emanuele Del Sozzo
Marco D. Santambrogio
Mirko Salaris
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2022
Field of study

Archivio istituzionale della ricerca - Politecnico di Milano

An Energy-Efficient Domain-Specific Architecture for Regular Expressions

Author: Alberto Scolari
Alessandro Comodi
Davide Conficconi
Emanuele Del Sozzo
Filippo Carloni
Marco Domenico Santambrogio
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2022
Field of study

Archivio istituzionale della ricerca - Politecnico di Milano

A pipelined and scalable dataflow implementation of convolutional neural networks on FPGA

Author: Bacis Marco
Del Sozzo Emanuele
Natale Giuseppe
Santambrogio Marco Domenico
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2017
Field of study

Convolutional Neural Network (CNN) is a deep learning algorithm extended from Artificial Neural Network (ANN) and widely used for image classification and recognition, thanks to its invariance to distortions. The recent rapid growth of applications based on deep learning algorithms, especially in the context of Big Data analytics, has dramatically improved both industrial and academic research and exploration of optimized implementations of CNNs on accelerators such as GPUs, FPGAs and ASICs, as general purpose processors can hardly meet the ever increasing performance and energy-efficiency requirements. FPGAs in particular are one of the most attractive alternative, as they allow the exploitation of the implicit parallelism of the algorithm and the acceleration of the different layers of a CNN with custom optimizations, while retaining extreme flexibility thanks to their reconfigurability. In this work, we propose a methodology to implement CNNs on FPGAs in a modular, scalable way. This is done by exploiting the dataflow pattern of convolutions, using an approach derived from previous work on the acceleration of Iterative Stencil Loops (ISLs), a computational pattern that shares some characteristics with convolutions. Furthermore, this approach allows the imple- mentation of a high-level pipeline between the different network layers, resulting in an increase of the overall performance when the CNN is employed to process batches of multiple images, as it would happen in real-life scenarios

Archivio istituzionale della ricerca - Politecnico di Milano

Crossref